Skip to content

Conversation

@jackzhxng
Copy link
Contributor

@jackzhxng jackzhxng commented Aug 25, 2025

Summary

Utilize multimodal_runner.h to run Voxtral exported from Optimum Executorch.

The runner takes in a .pt file of a preprocessed audio recording and feeds it a C++ multimodal runner.

Example output:

This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says "
PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:33.116291 executorch:stats.h:104]       Prompt Tokens: 1138    Generated Tokens: 99
I 00:00:33.116304 executorch:stats.h:110]       Model Load Time:                0.000000 (seconds)
I 00:00:33.116312 executorch:stats.h:117]       Total inference time:           25.274000 (seconds)              Rate:  3.917069 (tokens/second)
I 00:00:33.116320 executorch:stats.h:127]               Prompt evaluation:      12.169000 (seconds)              Rate:  93.516312 (tokens/second)
I 00:00:33.116327 executorch:stats.h:136]               Generated 99 tokens:    13.105000 (seconds)              Rate:  7.554369 (tokens/second)
I 00:00:33.116338 executorch:stats.h:147]       Time to first generated token:  12.169000 (seconds)
I 00:00:33.116344 executorch:stats.h:153]       Sampling time over 1237 tokens: 0.096000 (seconds)

Test plan

Build and run:

# Build and install ExecuTorch
cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release

# Build and install Voxtral runner
cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release

# Run Voxtral runner
./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13663

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit f9c1771 with merge base 99e6349 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 25, 2025
jackzhxng added a commit that referenced this pull request Aug 26, 2025
ghstack-source-id: 0fbf7c4
Pull Request resolved: #13663
jackzhxng added a commit that referenced this pull request Aug 26, 2025
ghstack-source-id: 0fbf7c4
Pull Request resolved: #13663
jackzhxng added a commit that referenced this pull request Aug 26, 2025
ghstack-source-id: 0fbf7c4
Pull Request resolved: #13663
### Summary
Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126).

The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner.

Example output:
```
This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says "
PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:33.116291 executorch:stats.h:104]       Prompt Tokens: 1138    Generated Tokens: 99
I 00:00:33.116304 executorch:stats.h:110]       Model Load Time:                0.000000 (seconds)
I 00:00:33.116312 executorch:stats.h:117]       Total inference time:           25.274000 (seconds)              Rate:  3.917069 (tokens/second)
I 00:00:33.116320 executorch:stats.h:127]               Prompt evaluation:      12.169000 (seconds)              Rate:  93.516312 (tokens/second)
I 00:00:33.116327 executorch:stats.h:136]               Generated 99 tokens:    13.105000 (seconds)              Rate:  7.554369 (tokens/second)
I 00:00:33.116338 executorch:stats.h:147]       Time to first generated token:  12.169000 (seconds)
I 00:00:33.116344 executorch:stats.h:153]       Sampling time over 1237 tokens: 0.096000 (seconds)
```

### Test plan
Build and run:
```
# Build and install ExecuTorch
cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release

# Build and install Voxtral runner
cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release

# Run Voxtral runner
./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin
```





[ghstack-poisoned]
jackzhxng added a commit that referenced this pull request Aug 26, 2025
ghstack-source-id: c51dcf5
Pull Request resolved: #13663
@jackzhxng jackzhxng added the release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava label Aug 26, 2025
jackzhxng added a commit that referenced this pull request Aug 26, 2025
ghstack-source-id: c51dcf5
Pull Request resolved: #13663
jackzhxng added a commit that referenced this pull request Aug 26, 2025
ghstack-source-id: c51dcf5
Pull Request resolved: #13663
### Summary
Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126).

The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner.

Example output:
```
This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says "
PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:33.116291 executorch:stats.h:104]       Prompt Tokens: 1138    Generated Tokens: 99
I 00:00:33.116304 executorch:stats.h:110]       Model Load Time:                0.000000 (seconds)
I 00:00:33.116312 executorch:stats.h:117]       Total inference time:           25.274000 (seconds)              Rate:  3.917069 (tokens/second)
I 00:00:33.116320 executorch:stats.h:127]               Prompt evaluation:      12.169000 (seconds)              Rate:  93.516312 (tokens/second)
I 00:00:33.116327 executorch:stats.h:136]               Generated 99 tokens:    13.105000 (seconds)              Rate:  7.554369 (tokens/second)
I 00:00:33.116338 executorch:stats.h:147]       Time to first generated token:  12.169000 (seconds)
I 00:00:33.116344 executorch:stats.h:153]       Sampling time over 1237 tokens: 0.096000 (seconds)
```

### Test plan
Build and run:
```
# Build and install ExecuTorch
cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release

# Build and install Voxtral runner
cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release

# Run Voxtral runner
./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin
```





[ghstack-poisoned]
@jackzhxng jackzhxng requested a review from mergennachin August 26, 2025 19:53
jackzhxng added a commit that referenced this pull request Aug 26, 2025
ghstack-source-id: 22066a6
Pull Request resolved: #13663
### Summary
Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126).

The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner.

Example output:
```
This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says "
PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:33.116291 executorch:stats.h:104]       Prompt Tokens: 1138    Generated Tokens: 99
I 00:00:33.116304 executorch:stats.h:110]       Model Load Time:                0.000000 (seconds)
I 00:00:33.116312 executorch:stats.h:117]       Total inference time:           25.274000 (seconds)              Rate:  3.917069 (tokens/second)
I 00:00:33.116320 executorch:stats.h:127]               Prompt evaluation:      12.169000 (seconds)              Rate:  93.516312 (tokens/second)
I 00:00:33.116327 executorch:stats.h:136]               Generated 99 tokens:    13.105000 (seconds)              Rate:  7.554369 (tokens/second)
I 00:00:33.116338 executorch:stats.h:147]       Time to first generated token:  12.169000 (seconds)
I 00:00:33.116344 executorch:stats.h:153]       Sampling time over 1237 tokens: 0.096000 (seconds)
```

### Test plan
Build and run:
```
# Build and install ExecuTorch
cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release

# Build and install Voxtral runner
cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release

# Run Voxtral runner
./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin
```





[ghstack-poisoned]
size_t bos_token_index,
size_t eos_token_index) {
runtime::runtime_init();
auto tekken_tokenizer = std::make_unique<tokenizers::Tekken>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I follow what this is doing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hf tokenizer can "load" the tekken tokenizer since it's also a json, which we don't want since the pattern I see here is that we keep loading different tokenizers until one works

### Summary
Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126).

The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner.

Example output:
```
This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says "
PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:33.116291 executorch:stats.h:104]       Prompt Tokens: 1138    Generated Tokens: 99
I 00:00:33.116304 executorch:stats.h:110]       Model Load Time:                0.000000 (seconds)
I 00:00:33.116312 executorch:stats.h:117]       Total inference time:           25.274000 (seconds)              Rate:  3.917069 (tokens/second)
I 00:00:33.116320 executorch:stats.h:127]               Prompt evaluation:      12.169000 (seconds)              Rate:  93.516312 (tokens/second)
I 00:00:33.116327 executorch:stats.h:136]               Generated 99 tokens:    13.105000 (seconds)              Rate:  7.554369 (tokens/second)
I 00:00:33.116338 executorch:stats.h:147]       Time to first generated token:  12.169000 (seconds)
I 00:00:33.116344 executorch:stats.h:153]       Sampling time over 1237 tokens: 0.096000 (seconds)
```

### Test plan
Build and run:
```
# Build and install ExecuTorch
cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release

# Build and install Voxtral runner
cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release

# Run Voxtral runner
./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin
```





[ghstack-poisoned]
### Summary
Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126).

The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner.

Example output:
```
This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says "
PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:33.116291 executorch:stats.h:104]       Prompt Tokens: 1138    Generated Tokens: 99
I 00:00:33.116304 executorch:stats.h:110]       Model Load Time:                0.000000 (seconds)
I 00:00:33.116312 executorch:stats.h:117]       Total inference time:           25.274000 (seconds)              Rate:  3.917069 (tokens/second)
I 00:00:33.116320 executorch:stats.h:127]               Prompt evaluation:      12.169000 (seconds)              Rate:  93.516312 (tokens/second)
I 00:00:33.116327 executorch:stats.h:136]               Generated 99 tokens:    13.105000 (seconds)              Rate:  7.554369 (tokens/second)
I 00:00:33.116338 executorch:stats.h:147]       Time to first generated token:  12.169000 (seconds)
I 00:00:33.116344 executorch:stats.h:153]       Sampling time over 1237 tokens: 0.096000 (seconds)
```

### Test plan
Build and run:
```
# Build and install ExecuTorch
cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release

# Build and install Voxtral runner
cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release

# Run Voxtral runner
./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin
```





[ghstack-poisoned]
### Summary
Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126).

The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner.

Example output:
```
This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says "
PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:33.116291 executorch:stats.h:104]       Prompt Tokens: 1138    Generated Tokens: 99
I 00:00:33.116304 executorch:stats.h:110]       Model Load Time:                0.000000 (seconds)
I 00:00:33.116312 executorch:stats.h:117]       Total inference time:           25.274000 (seconds)              Rate:  3.917069 (tokens/second)
I 00:00:33.116320 executorch:stats.h:127]               Prompt evaluation:      12.169000 (seconds)              Rate:  93.516312 (tokens/second)
I 00:00:33.116327 executorch:stats.h:136]               Generated 99 tokens:    13.105000 (seconds)              Rate:  7.554369 (tokens/second)
I 00:00:33.116338 executorch:stats.h:147]       Time to first generated token:  12.169000 (seconds)
I 00:00:33.116344 executorch:stats.h:153]       Sampling time over 1237 tokens: 0.096000 (seconds)
```

### Test plan
Build and run:
```
# Build and install ExecuTorch
cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release

# Build and install Voxtral runner
cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release

# Run Voxtral runner
./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin
```





[ghstack-poisoned]
### Summary
Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126).

The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner.

Example output:
```
This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says "
PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:33.116291 executorch:stats.h:104]       Prompt Tokens: 1138    Generated Tokens: 99
I 00:00:33.116304 executorch:stats.h:110]       Model Load Time:                0.000000 (seconds)
I 00:00:33.116312 executorch:stats.h:117]       Total inference time:           25.274000 (seconds)              Rate:  3.917069 (tokens/second)
I 00:00:33.116320 executorch:stats.h:127]               Prompt evaluation:      12.169000 (seconds)              Rate:  93.516312 (tokens/second)
I 00:00:33.116327 executorch:stats.h:136]               Generated 99 tokens:    13.105000 (seconds)              Rate:  7.554369 (tokens/second)
I 00:00:33.116338 executorch:stats.h:147]       Time to first generated token:  12.169000 (seconds)
I 00:00:33.116344 executorch:stats.h:153]       Sampling time over 1237 tokens: 0.096000 (seconds)
```

### Test plan
Build and run:
```
# Build and install ExecuTorch
cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release

# Build and install Voxtral runner
cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release

# Run Voxtral runner
./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin
```





[ghstack-poisoned]
// Prepare inputs
std::vector<MultimodalInput> inputs;

// 1. Add start bos-related text inputs and modality start token.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is show we run audio inputs with multimodal runner?

@jackzhxng jackzhxng merged commit f9c1771 into gh/jackzhxng/31/base Sep 2, 2025
111 of 112 checks passed
@jackzhxng jackzhxng deleted the gh/jackzhxng/31/head branch September 2, 2025 02:54
This was referenced Sep 2, 2025
@jackzhxng jackzhxng mentioned this pull request Sep 2, 2025
jackzhxng added a commit that referenced this pull request Sep 3, 2025
(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13663)


Differential Revision: [D81498749](https://our.internmc.facebook.com/intern/diff/D81498749)

[ghstack-poisoned]
jackzhxng added a commit that referenced this pull request Sep 3, 2025
(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13663)


Differential Revision: [D81498749](https://our.internmc.facebook.com/intern/diff/D81498749)

[ghstack-poisoned]
kirklandsign pushed a commit that referenced this pull request Sep 3, 2025
### Summary
Utilize `multimodal_runner.h` to run [Voxtral exported from Optimum Executorch](huggingface/optimum-executorch#126).

The runner takes in a `.pt` file of a preprocessed audio recording and feeds it a C++ multimodal runner.

Example output:
```
This audio is a casual and somewhat silly conversation between two speakers who seem to be discussing their tattoos. The speakers are engaging in a game where they ask each other what their tattoos say, but both repeatedly say "sweet" instead of the actual words. The speakers are aware of their mistake and try to correct it by asking the other what their tattoo says, but they still end up saying "sweet" again. The conversation ends with a speaker telling the other that their tattoo says "
PyTorchObserver {"prompt_tokens":1138,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1756159197436,"inference_end_ms":1756159222710,"prompt_eval_end_ms":1756159209605,"first_token_ms":1756159209605,"aggregate_sampling_time_ms":96,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:33.116291 executorch:stats.h:104]       Prompt Tokens: 1138    Generated Tokens: 99
I 00:00:33.116304 executorch:stats.h:110]       Model Load Time:                0.000000 (seconds)
I 00:00:33.116312 executorch:stats.h:117]       Total inference time:           25.274000 (seconds)              Rate:  3.917069 (tokens/second)
I 00:00:33.116320 executorch:stats.h:127]               Prompt evaluation:      12.169000 (seconds)              Rate:  93.516312 (tokens/second)
I 00:00:33.116327 executorch:stats.h:136]               Generated 99 tokens:    13.105000 (seconds)              Rate:  7.554369 (tokens/second)
I 00:00:33.116338 executorch:stats.h:147]       Time to first generated token:  12.169000 (seconds)
I 00:00:33.116344 executorch:stats.h:153]       Sampling time over 1237 tokens: 0.096000 (seconds)
```

### Test plan
Build and run:
```
# Build and install ExecuTorch
cmake --preset llm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=cmake-out -DEXECUTORCH_ENABLE_LOGGING=ON && cmake --build cmake-out -j16 --target install --config Release

# Build and install Voxtral runner
cmake -DCMAKE_INSTALL_PREFIX=cmake-out -DBUILD_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -Bcmake-out/examples/models/voxtral examples/models/voxtral && cmake --build cmake-out/examples/models/voxtral -j16 --config Release

# Run Voxtral runner
./cmake-out/examples/models/voxtral/voxtral_runner --model_path ~/models/voxtral/voxtral_q8da4w_edm_qe4w_d_split_metadata_unsqueeze.pte --tokenizer_path ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json --prompt "What can you tell me about this audio?" --audio_path ~/models/voxtral/input_features.bin
```


Pull Request resolved: #13663
jackzhxng added a commit that referenced this pull request Sep 4, 2025
(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13663)


Differential Revision: [D81498749](https://our.internmc.facebook.com/intern/diff/D81498749)

[ghstack-poisoned]
jackzhxng added a commit that referenced this pull request Sep 4, 2025
(Messed up the merge for the original stack, this is reland. Original PR with comments here - #13663)


Differential Revision: [D81498749](https://our.internmc.facebook.com/intern/diff/D81498749)

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants